Three excel files from 2015, 2016, and 2017 (named candy_2015, candy_2016, and candy_2017 respectfully) showing survey answers to questions related to enjoyment of certain candies were read, joined, and cleaned before being written into a csv file which has been read below. All three questionnaires asked for respondent’s ages however only 2016 and 2017 questionnaires asked for their gender and for the country and state/province the respondent was from. Respondent’s must have been given the questionnaires to fill in themselves as spellings are incredibly varied (e.g. “US”, “u.s.a.”, “united states”) and there are some rather interesting responses, especially under age (e.g. “old enough”) and country (e.g. “’murrika!”). From first glance, the candy names that have been asked about are vast and range from the full name to a guess (e.g. “Tolberone something or other”), this may lead to there not being great discrepancies in the results. Further “full sized candy bars” seems rather generic so it has been ignored when considering some of the questions.
Loading appropriate libraries
library(tidyverse)
library(readxl)
library(stringr)
library(here)
Warning: package ‘here’ was built under R version 4.1.3
here() starts at C:/Users/mahri/OneDrive/CodeClan/dirty_data_project/dirty_data_codeclan_project_mahri/dirty_data_task_4_mahri
Reading the clean data csv file.
candy_cleaned <- read_csv("../clean_data/candy_cleaned.csv")
Rows: 9349 Columns: 96
-- Column specification --------------------------------------------------------------------------
Delimiter: ","
chr (93): trick_or_treating, country, state_or_prov, gender, butterfinger, x100_grand_bar, ano...
dbl (3): year, id_number, age
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Using glimpse, head, and viewing the full files, I got an overview of the data, variables, types, and differences between the files.
head(candy_2015)
glimpse(candy_2016)
Rows: 1,259
Columns: 123
$ Timestamp <dttm> ~
$ `Are you going actually going trick or treating yourself?` <chr> ~
$ `Your gender:` <chr> ~
$ `How old are you?` <chr> ~
$ `Which country do you live in?` <chr> ~
$ `Which state, province, county do you live in?` <chr> ~
$ `[100 Grand Bar]` <chr> ~
$ `[Anonymous brown globs that come in black and orange wrappers]` <chr> ~
$ `[Any full-sized candy bar]` <chr> ~
$ `[Black Jacks]` <chr> ~
$ `[Bonkers (the candy)]` <chr> ~
$ `[Bonkers (the board game)]` <chr> ~
$ `[Bottle Caps]` <chr> ~
$ `[Box'o'Raisins]` <chr> ~
$ `[Broken glow stick]` <chr> ~
$ `[Butterfinger]` <chr> ~
$ `[Cadbury Creme Eggs]` <chr> ~
$ `[Candy Corn]` <chr> ~
$ `[Candy that is clearly just the stuff given out for free at restaurants]` <chr> ~
$ `[Caramellos]` <chr> ~
$ `[Cash, or other forms of legal tender]` <chr> ~
$ `[Chardonnay]` <chr> ~
$ `[Chick-o-Sticks (we don’t know what that is)]` <chr> ~
$ `[Chiclets]` <chr> ~
$ `[Coffee Crisp]` <chr> ~
$ `[Creepy Religious comics/Chick Tracts]` <chr> ~
$ `[Dental paraphenalia]` <chr> ~
$ `[Dots]` <chr> ~
$ `[Dove Bars]` <chr> ~
$ `[Fuzzy Peaches]` <chr> ~
$ `[Generic Brand Acetaminophen]` <chr> ~
$ `[Glow sticks]` <chr> ~
$ `[Goo Goo Clusters]` <chr> ~
$ `[Good N' Plenty]` <chr> ~
$ `[Gum from baseball cards]` <chr> ~
$ `[Gummy Bears straight up]` <chr> ~
$ `[Hard Candy]` <chr> ~
$ `[Healthy Fruit]` <chr> ~
$ `[Heath Bar]` <chr> ~
$ `[Hershey's Dark Chocolate]` <chr> ~
$ `[Hershey’s Milk Chocolate]` <chr> ~
$ `[Hershey's Kisses]` <chr> ~
$ `[Hugs (actual physical hugs)]` <chr> ~
$ `[Jolly Rancher (bad flavor)]` <chr> ~
$ `[Jolly Ranchers (good flavor)]` <chr> ~
$ `[JoyJoy (Mit Iodine!)]` <chr> ~
$ `[Junior Mints]` <chr> ~
$ `[Senior Mints]` <chr> ~
$ `[Kale smoothie]` <chr> ~
$ `[Kinder Happy Hippo]` <chr> ~
$ `[Kit Kat]` <chr> ~
$ `[LaffyTaffy]` <chr> ~
$ `[LemonHeads]` <chr> ~
$ `[Licorice (not black)]` <chr> ~
$ `[Licorice (yes black)]` <chr> ~
$ `[Lindt Truffle]` <chr> ~
$ `[Lollipops]` <chr> ~
$ `[Mars]` <chr> ~
$ `[Mary Janes]` <chr> ~
$ `[Maynards]` <chr> ~
$ `[Mike and Ike]` <chr> ~
$ `[Milk Duds]` <chr> ~
$ `[Milky Way]` <chr> ~
$ `[Regular M&Ms]` <chr> ~
$ `[Peanut M&M’s]` <chr> ~
$ `[Blue M&M's]` <chr> ~
$ `[Red M&M's]` <chr> ~
$ `[Third Party M&M's]` <chr> ~
$ `[Minibags of chips]` <chr> ~
$ `[Mint Kisses]` <chr> ~
$ `[Mint Juleps]` <chr> ~
$ `[Mr. Goodbar]` <chr> ~
$ `[Necco Wafers]` <chr> ~
$ `[Nerds]` <chr> ~
$ `[Nestle Crunch]` <chr> ~
$ `[Now'n'Laters]` <chr> ~
$ `[Peeps]` <chr> ~
$ `[Pencils]` <chr> ~
$ `[Person of Interest Season 3 DVD Box Set (not including Disc 4 with hilarious outtakes)]` <chr> ~
$ `[Pixy Stix]` <chr> ~
$ `[Reese’s Peanut Butter Cups]` <chr> ~
$ `[Reese's Pieces]` <chr> ~
$ `[Reggie Jackson Bar]` <chr> ~
$ `[Rolos]` <chr> ~
$ `[Skittles]` <chr> ~
$ `[Smarties (American)]` <chr> ~
$ `[Smarties (Commonwealth)]` <chr> ~
$ `[Snickers]` <chr> ~
$ `[Sourpatch Kids (i.e. abominations of nature)]` <chr> ~
$ `[Spotted Dick]` <chr> ~
$ `[Starburst]` <chr> ~
$ `[Sweet Tarts]` <chr> ~
$ `[Swedish Fish]` <chr> ~
$ `[Sweetums (a friend to diabetes)]` <chr> ~
$ `[Tic Tacs]` <chr> ~
$ `[Those odd marshmallow circus peanut things]` <chr> ~
$ `[Three Musketeers]` <chr> ~
$ `[Tolberone something or other]` <chr> ~
$ `[Trail Mix]` <chr> ~
$ `[Twix]` <chr> ~
$ `[Vials of pure high fructose corn syrup, for main-lining into your vein]` <chr> ~
$ `[Vicodin]` <chr> ~
$ `[Whatchamacallit Bars]` <chr> ~
$ `[White Bread]` <chr> ~
$ `[Whole Wheat anything]` <chr> ~
$ `[York Peppermint Patties]` <chr> ~
$ `Please list any items not included above that give you JOY.` <chr> ~
$ `Please list any items not included above that give you DESPAIR.` <chr> ~
$ `Please leave any witty, snarky or thoughtful remarks or comments regarding your choices.` <chr> ~
$ `Guess the number of mints in my hand.` <chr> ~
$ `Betty or Veronica?` <chr> ~
$ `"That dress* that went viral a few years back - when I first saw it, it was ________"` <chr> ~
$ `What is your favourite font?` <chr> ~
$ `Please estimate the degree(s) of separation you have from the following celebrities [JK Rowling]` <chr> ~
$ `Please estimate the degree(s) of separation you have from the following celebrities [JJ Abrams]` <chr> ~
$ `Please estimate the degree(s) of separation you have from the following celebrities [Beyoncé]` <chr> ~
$ `Please estimate the degree(s) of separation you have from the following celebrities [Bieber]` <chr> ~
$ `Please estimate the degree(s) of separation you have from the following celebrities [Kevin Bacon]` <chr> ~
$ `Please estimate the degree(s) of separation you have from the following celebrities [Francis Bacon (1561 - 1626)]` <chr> ~
$ `Which day do you prefer, Friday or Sunday?` <chr> ~
$ `Do you eat apples the correct way, East to West (side to side) or do you eat them like a freak of nature, South to North (bottom to top)?` <chr> ~
$ `When you see the above image of the 4 different websites, which one would you most likely check out (please be honest).` <chr> ~
$ `[York Peppermint Patties] Ignore` <lgl> ~
glimpse(candy_2017)
Rows: 2,460
Columns: 120
$ `Internal ID` <dbl> ~
$ `Q1: GOING OUT?` <chr> ~
$ `Q2: GENDER` <chr> ~
$ `Q3: AGE` <chr> ~
$ `Q4: COUNTRY` <chr> ~
$ `Q5: STATE, PROVINCE, COUNTY, ETC` <chr> ~
$ `Q6 | 100 Grand Bar` <chr> ~
$ `Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)` <chr> ~
$ `Q6 | Any full-sized candy bar` <chr> ~
$ `Q6 | Black Jacks` <chr> ~
$ `Q6 | Bonkers (the candy)` <chr> ~
$ `Q6 | Bonkers (the board game)` <chr> ~
$ `Q6 | Bottle Caps` <chr> ~
$ `Q6 | Box'o'Raisins` <chr> ~
$ `Q6 | Broken glow stick` <chr> ~
$ `Q6 | Butterfinger` <chr> ~
$ `Q6 | Cadbury Creme Eggs` <chr> ~
$ `Q6 | Candy Corn` <chr> ~
$ `Q6 | Candy that is clearly just the stuff given out for free at restaurants` <chr> ~
$ `Q6 | Caramellos` <chr> ~
$ `Q6 | Cash, or other forms of legal tender` <chr> ~
$ `Q6 | Chardonnay` <chr> ~
$ `Q6 | Chick-o-Sticks (we don’t know what that is)` <chr> ~
$ `Q6 | Chiclets` <chr> ~
$ `Q6 | Coffee Crisp` <chr> ~
$ `Q6 | Creepy Religious comics/Chick Tracts` <chr> ~
$ `Q6 | Dental paraphenalia` <chr> ~
$ `Q6 | Dots` <chr> ~
$ `Q6 | Dove Bars` <chr> ~
$ `Q6 | Fuzzy Peaches` <chr> ~
$ `Q6 | Generic Brand Acetaminophen` <chr> ~
$ `Q6 | Glow sticks` <chr> ~
$ `Q6 | Goo Goo Clusters` <chr> ~
$ `Q6 | Good N' Plenty` <chr> ~
$ `Q6 | Gum from baseball cards` <chr> ~
$ `Q6 | Gummy Bears straight up` <chr> ~
$ `Q6 | Hard Candy` <chr> ~
$ `Q6 | Healthy Fruit` <chr> ~
$ `Q6 | Heath Bar` <chr> ~
$ `Q6 | Hershey's Dark Chocolate` <chr> ~
$ `Q6 | Hershey’s Milk Chocolate` <chr> ~
$ `Q6 | Hershey's Kisses` <chr> ~
$ `Q6 | Hugs (actual physical hugs)` <chr> ~
$ `Q6 | Jolly Rancher (bad flavor)` <chr> ~
$ `Q6 | Jolly Ranchers (good flavor)` <chr> ~
$ `Q6 | JoyJoy (Mit Iodine!)` <chr> ~
$ `Q6 | Junior Mints` <chr> ~
$ `Q6 | Senior Mints` <chr> ~
$ `Q6 | Kale smoothie` <chr> ~
$ `Q6 | Kinder Happy Hippo` <chr> ~
$ `Q6 | Kit Kat` <chr> ~
$ `Q6 | LaffyTaffy` <chr> ~
$ `Q6 | LemonHeads` <chr> ~
$ `Q6 | Licorice (not black)` <chr> ~
$ `Q6 | Licorice (yes black)` <chr> ~
$ `Q6 | Lindt Truffle` <chr> ~
$ `Q6 | Lollipops` <chr> ~
$ `Q6 | Mars` <chr> ~
$ `Q6 | Maynards` <chr> ~
$ `Q6 | Mike and Ike` <chr> ~
$ `Q6 | Milk Duds` <chr> ~
$ `Q6 | Milky Way` <chr> ~
$ `Q6 | Regular M&Ms` <chr> ~
$ `Q6 | Peanut M&M’s` <chr> ~
$ `Q6 | Blue M&M's` <chr> ~
$ `Q6 | Red M&M's` <chr> ~
$ `Q6 | Green Party M&M's` <chr> ~
$ `Q6 | Independent M&M's` <chr> ~
$ `Q6 | Abstained from M&M'ing.` <chr> ~
$ `Q6 | Minibags of chips` <chr> ~
$ `Q6 | Mint Kisses` <chr> ~
$ `Q6 | Mint Juleps` <chr> ~
$ `Q6 | Mr. Goodbar` <chr> ~
$ `Q6 | Necco Wafers` <chr> ~
$ `Q6 | Nerds` <chr> ~
$ `Q6 | Nestle Crunch` <chr> ~
$ `Q6 | Now'n'Laters` <chr> ~
$ `Q6 | Peeps` <chr> ~
$ `Q6 | Pencils` <chr> ~
$ `Q6 | Pixy Stix` <chr> ~
$ `Q6 | Real Housewives of Orange County Season 9 Blue-Ray` <chr> ~
$ `Q6 | Reese’s Peanut Butter Cups` <chr> ~
$ `Q6 | Reese's Pieces` <chr> ~
$ `Q6 | Reggie Jackson Bar` <chr> ~
$ `Q6 | Rolos` <chr> ~
$ `Q6 | Sandwich-sized bags filled with BooBerry Crunch` <chr> ~
$ `Q6 | Skittles` <chr> ~
$ `Q6 | Smarties (American)` <chr> ~
$ `Q6 | Smarties (Commonwealth)` <chr> ~
$ `Q6 | Snickers` <chr> ~
$ `Q6 | Sourpatch Kids (i.e. abominations of nature)` <chr> ~
$ `Q6 | Spotted Dick` <chr> ~
$ `Q6 | Starburst` <chr> ~
$ `Q6 | Sweet Tarts` <chr> ~
$ `Q6 | Swedish Fish` <chr> ~
$ `Q6 | Sweetums (a friend to diabetes)` <chr> ~
$ `Q6 | Take 5` <chr> ~
$ `Q6 | Tic Tacs` <chr> ~
$ `Q6 | Those odd marshmallow circus peanut things` <chr> ~
$ `Q6 | Three Musketeers` <chr> ~
$ `Q6 | Tolberone something or other` <chr> ~
$ `Q6 | Trail Mix` <chr> ~
$ `Q6 | Twix` <chr> ~
$ `Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein` <chr> ~
$ `Q6 | Vicodin` <chr> ~
$ `Q6 | Whatchamacallit Bars` <chr> ~
$ `Q6 | White Bread` <chr> ~
$ `Q6 | Whole Wheat anything` <chr> ~
$ `Q6 | York Peppermint Patties` <chr> ~
$ `Q7: JOY OTHER` <chr> ~
$ `Q8: DESPAIR OTHER` <chr> ~
$ `Q9: OTHER COMMENTS` <chr> ~
$ `Q10: DRESS` <chr> ~
$ ...114 <chr> ~
$ `Q11: DAY` <chr> ~
$ `Q12: MEDIA [Daily Dish]` <dbl> ~
$ `Q12: MEDIA [Science]` <dbl> ~
$ `Q12: MEDIA [ESPN]` <dbl> ~
$ `Q12: MEDIA [Yahoo]` <dbl> ~
$ `Click Coordinates (x, y)` <chr> ~
Using Janitor to tidy up column names, bringing all to lower case and separating words with _
library(janitor)
janitor_candy_2015 <- janitor::clean_names(candy_2015)
janitor_candy_2015
janitor_candy_2016 <- janitor::clean_names(candy_2016)
janitor_candy_2016
janitor_candy_2017 <- janitor::clean_names(candy_2017)
janitor_candy_2017
Some columns took my interest due to confusion about what they might be about or whether people tended to have a certain opinion from first glance. For example, I was curious as to who was reporting back about these candies:
janitor_candy_2017 %>%
select(q3_age, q2_gender, q6_independent_m_ms, q6_green_party_m_ms)
# all types of people
janitor_candy_2016 %>%
select(york_peppermint_patties_ignore)
#nobody
####REMOVE AND RENAME
Step 1 - Remove and add for each year (2015, 2016, 2017)
Step 2 - RENAME FOR EACH YEAR
2015 REMOVE AND ADD
# names(janitor_candy_2015)
col_removed_candy_2015 <- janitor_candy_2015 %>%
select(-c(116:124), -c(97:113), -c(93:95), -c(90, 91),
-c(peterson_brand_sidewalk_chalk, spotted_dick, mint_leaves,
joy_joy_mit_iodine, minibags_of_chips, lapel_pins, kale_smoothie,
hugs_actual_physical_hugs, heath_bar, healthy_fruit,
creepy_religious_comics_chick_tracts, broken_glow_stick,
glow_sticks, generic_brand_acetaminophen, dental_paraphenalia,
cash_or_other_forms_of_legal_tender,
vials_of_pure_high_fructose_corn_syrup_for_main_lining_into_your_vein,
box_o_raisins, timestamp)) %>%
add_column(year = "2015", .before = 1) %>%
mutate(id_number = row_number(), .before = 2)
col_removed_candy_2015
#view(col_removed_candy_2015)
2015 RENAME
# Looking at differences between responses for Mary Janes ratings
# col_removed_candy_2015 %>%
# select(anonymous_brown_globs_that_come_in_black_and_orange_wrappers, mary_janes)
candy_2015_renamed <- col_removed_candy_2015 %>%
rename(age = how_old_are_you,
trick_or_treating = are_you_going_actually_going_trick_or_treating_yourself,
anonymous_black_and_orange_wrapper =
anonymous_brown_globs_that_come_in_black_and_orange_wrappers,
brach_not_including_candy_corn = brach_products_not_including_candy_corn,
restaurant_candy =
candy_that_is_clearly_just_the_stuff_given_out_for_free_at_restaurants,
hersheys_dark_chocolate = dark_chocolate_hershey,
gummy_bears = gummy_bears_straight_up,
hersheys_kissables = hershey_s_kissables,
hersheys_milk_chocolate = hershey_s_milk_chocolate,
licorice_black = licorice,
reeses_peanut_butter_cups = reese_s_peanut_butter_cups,
toblerone = tolberone_something_or_other,
peanut_m_ms = peanut_m_m_s,
chick_o_stick = chick_o_sticks_we_don_t_know_what_that_is,
circus_peanuts = those_odd_marshmallow_circus_peanut_things,
sea_salt_chocolate =
sea_salt_flavored_stuff_probably_chocolate_since_this_is_the_it_flavor_of_the_year)
2016 - REMOVE AND ADD
col_removed_candy_2016 <- janitor_candy_2016 %>%
select(-c(104, 105, 107:123),
-c(vicodin, vials_of_pure_high_fructose_corn_syrup_for_main_lining_into_your_vein,
trail_mix, spotted_dick,
person_of_interest_season_3_dvd_box_set_not_including_disc_4_with_hilarious_outtakes,
minibags_of_chips, kale_smoothie, joy_joy_mit_iodine, hugs_actual_physical_hugs,
heath_bar, healthy_fruit, glow_sticks, generic_brand_acetaminophen,
dental_paraphenalia, creepy_religious_comics_chick_tracts, chardonnay,
cash_or_other_forms_of_legal_tender, broken_glow_stick, boxo_raisins,
bonkers_the_board_game, timestamp)) %>%
add_column(year = "2016", .before = 1) %>%
mutate(id_number = max(candy_2015_renamed$id_number) + row_number(), .before = 2)
col_removed_candy_2016
2016 RENAME
# Again checking on Mary Janes column differences
# col_removed_candy_2016 %>%
# select(anonymous_brown_globs_that_come_in_black_and_orange_wrappers, mary_janes)
candy_2016_renamed <- col_removed_candy_2016 %>%
rename(trick_or_treating =
are_you_going_actually_going_trick_or_treating_yourself,
gender = your_gender,
age = how_old_are_you,
country = which_country_do_you_live_in,
state_or_prov = which_state_province_county_do_you_live_in,
anonymous_black_and_orange_wrapper =
anonymous_brown_globs_that_come_in_black_and_orange_wrappers,
bonkers = bonkers_the_candy,
restaurant_candy =
candy_that_is_clearly_just_the_stuff_given_out_for_free_at_restaurants,
chick_o_stick = chick_o_sticks_we_don_t_know_what_that_is,
gummy_bears = gummy_bears_straight_up,
hersheys_milk_chocolate = hershey_s_milk_chocolate,
licorice_black = licorice_yes_black,
peanut_m_ms = peanut_m_m_s,
party_bag_m_ms = third_party_m_ms,
reeses_peanut_butter_cups = reese_s_peanut_butter_cups,
sourpatch_kids = sourpatch_kids_i_e_abominations_of_nature,
sweetarts = sweet_tarts,
sweetums = sweetums_a_friend_to_diabetes,
circus_peanuts = those_odd_marshmallow_circus_peanut_things,
toblerone = tolberone_something_or_other)
candy_2016_renamed
NA
2017 CLEAN AND ADD Note - should have removed the “q6_” before this, but have done so in the next chunk
col_removed_candy_2017 <- janitor_candy_2017 %>%
select(-c(102, 104, 105, 107, 108, 110:120),
-c(q6_spotted_dick,
q6_sandwich_sized_bags_filled_with_boo_berry_crunch,
q6_real_housewives_of_orange_county_season_9_blue_ray,
q6_minibags_of_chips,
q6_abstained_from_m_ming,
q6_kale_smoothie, q6_joy_joy_mit_iodine,
q6_hugs_actual_physical_hugs,
q6_heath_bar,
q6_healthy_fruit,
q6_glow_sticks,
q6_generic_brand_acetaminophen,
q6_dental_paraphenalia,
q6_creepy_religious_comics_chick_tracts,
q6_chardonnay,
q6_cash_or_other_forms_of_legal_tender,
q6_broken_glow_stick,
q6_boxo_raisins,
q6_bonkers_the_board_game,
internal_id)) %>%
add_column(year = "2017", .before = 1) %>%
mutate(id_number = max(candy_2016_renamed$id_number) + row_number(), .before = 2)
col_removed_candy_2017
2017 RENAME Remove “q1/2/3/4/5/6” from the start of column names and rename to match 2015 and 16
candy_2017_q_removed <- col_removed_candy_2017 %>%
rename_all(~ sub("^[q0-9]{2}_", "",
make.names(names(col_removed_candy_2017))))
candy_2017_renamed <- candy_2017_q_removed %>%
rename(trick_or_treating = going_out,
state_or_prov = state_province_county_etc,
x100_grand_bar = `100_grand_bar`,
mary_janes =
anonymous_brown_globs_that_come_in_black_and_orange_wrappers_a_k_a_mary_janes,
bonkers = bonkers_the_candy,
restaurant_candy =
candy_that_is_clearly_just_the_stuff_given_out_for_free_at_restaurants,
chick_o_stick = chick_o_sticks_we_don_t_know_what_that_is,
gummy_bears = gummy_bears_straight_up,
hersheys_milk_chocolate = hershey_s_milk_chocolate,
licorice_black = licorice_yes_black,
peanut_m_ms = peanut_m_m_s,
green_m_ms = green_party_m_ms,
lone_m_ms = independent_m_ms,
reeses_peanut_butter_cups = reese_s_peanut_butter_cups,
sourpatch_kids = sourpatch_kids_i_e_abominations_of_nature,
sweetarts = sweet_tarts,
sweetums = sweetums_a_friend_to_diabetes,
circus_peanuts = those_odd_marshmallow_circus_peanut_things,
toblerone = tolberone_something_or_other)
candy_2017_renamed
Getting an idea of people’s responses:
distinct(candy_2015_renamed, age)
#(chr) 146 responses, some silly and some strange
distinct(candy_2015_renamed, trick_or_treating)
#(chr) yes or no (NAs in 2017)
distinct(candy_2015_renamed, starburst)
distinct(candy_2016_renamed, age)
# (chr) 98 incl silly/strange
distinct(candy_2016_renamed, trick_or_treating)
#(chr) Yes No (NAs in 2017)
distinct(candy_2016_renamed, gender)
# Male, Female, Other, I'd rather not say, NA
distinct(candy_2016_renamed, country)
#93 some silly, some e.g. USA, US, us, u.s.a. etc
distinct(candy_2016_renamed, starburst)
distinct(candy_2017_renamed, age)
#(chr) 107 incl silly/strange
distinct(candy_2017_renamed, trick_or_treating)
#(chr) Yes No and NA
distinct(candy_2017_renamed, gender)
#Male, Female, Other, I'd rather not say, NA
distinct(candy_2017_renamed, country)
#118 some silly, some e.g. USA, US, us, u.s.a. etc
distinct(candy_2017_renamed, starburst)
####JOINING
Joining all three years by binding rows so as to keep everything
bound_candy <- bind_rows(candy_2015_renamed,
candy_2016_renamed,
candy_2017_renamed)
bound_candy <- bound_candy %>%
relocate(country, .before = 5) %>%
relocate(state_or_prov, .before = 6) %>%
relocate(gender, .before = 7)
# view(bound_candy)
##QUESTIONS (with additional cleaning in amongst my answers)
####QUESTION 1 What is the total number of candy ratings given across the three years. (Number of candy ratings, not the number of raters. Don’t count missing values)
ANSWER The total number of candy ratings across the three years was 590,010 ratings
total_ratings <- bound_candy %>%
select(-c(year, id_number, age,
trick_or_treating, country,
state_or_prov, gender)) %>%
pivot_longer(butterfinger:take_5,
names_to = "candy",
values_to = "rating") %>%
filter(!is.na(rating)) %>%
count(n())
total_ratings
NA
####AGE CLEANING
Age is a character column with 274 distinct values. They are a mix of numbers, NAs, and strange and silly values. * I changed the “age” column to a numeric but it output strange figures, so I specified “as.integer” and the strange answers became NAs. * It seems unlikely that anyone over 100 years old is taking part, so I removed them.
# bound_candy %>%
# distinct(age)
bound_age_to_numeric <- bound_candy %>%
mutate(age = as.integer(age)) %>%
arrange(age)
Warning: Problem while computing `age = as.integer(age)`.
i NAs introduced by coercion
Warning: Problem while computing `age = as.integer(age)`.
i NAs introduced by coercion to integer range
# bound_age_to_numeric %>%
# distinct(age)
bound_age_cleaning <- bound_age_to_numeric %>%
mutate(age = if_else(age > 99, NA_integer_, age))
bound_age_cleaning
# view(bound_age_cleaning)
####QUESTION TWO What was the average age of people who are going out trick or treating?
ANSWER The (rounded) average age of those going trick or treating is 35 years old (34.94897 is the un-rounded value)
# bound_age_cleaning %>%
# distinct(trick_or_treating)
bound_age_cleaning %>%
select(age, trick_or_treating) %>%
group_by(trick_or_treating) %>%
summarise(average_age = round(mean(age, na.rm = TRUE)))
# to just get the answer for Yes on it's own:
bound_age_cleaning %>%
select(age, trick_or_treating) %>%
group_by(trick_or_treating) %>%
filter(trick_or_treating == "Yes") %>%
summarise(average_age = round(mean(age, na.rm = TRUE)))
NA
####QUESTION THREE What was the average age of people who are not going trick or treating?
ANSWER The (rounded) average age of those not going trick or treating is 39 years old. (39.10454 is the unrounded value)
bound_age_cleaning %>%
select(age, trick_or_treating) %>%
group_by(trick_or_treating) %>%
summarise(average_age = round(mean(age, na.rm = TRUE)))
# to just get the answer for No on it's own:
bound_age_cleaning %>%
select(age, trick_or_treating) %>%
group_by(trick_or_treating) %>%
filter(trick_or_treating == "No") %>%
summarise(average_age = round(mean(age, na.rm = TRUE)))
NA
####QUESTION FOUR For each of joy, despair and meh, which candy bar revived the most of these ratings?
ANSWER
* Despair: gum that comes with baseball cards returned the most despair
responses with 7,341 * Joy: Full sized candy bars made the most people
joyful with 7,589 responses. This seems very generic so I ran it again
to remove the full sized candy bars and the top Joy response was: 7369
responses for reese’s peanut butter cups * Meh: 1,570 “Meh” responses
were given for lollipops
rated_candy <- bound_candy %>%
select(-c(year, id_number, age,
trick_or_treating, country,
state_or_prov, gender,
any_full_sized_candy_bar)) %>%
pivot_longer(butterfinger:take_5,
names_to = "candy",
values_to = "rating")
rated_candy
rated_candy %>%
group_by(rating, candy) %>%
summarise(number_of_ratings = n()) %>%
filter(number_of_ratings == max(number_of_ratings))
`summarise()` has grouped output by 'rating'. You can override using the `.groups` argument.
####QUESTION FIVE How many people rated Starburst as despair?
ANSWER 1990 people voted despair for starbursts.
rated_candy %>%
filter(candy == "starburst") %>%
group_by(rating) %>%
summarise(rating_count = n())
NA
———-QUESTIONS SIX, SEVEN, AND EIGHT AFTER COUNTRY CLEAN————-
####COUNTRY CLEANING
Firstly getting an idea of NAs and distinct country values:
bound_candy %>%
filter(is.na(country))
bound_candy %>%
distinct(country)
Using stringr and regex to reduce “country” values
This started out a lot longer, and I am sure there are faster/ better ways of doing this, but I was practicing different options (If it were shorter - I am not sure about the difficulties that may arise if other countries were added at a later date?)
For columns that were numbers, inputs like “N. America”, or clearly fake (silly) answers, I checked the state or province column to see if there was a match to a country
I checked my work one row at a time to try and ensure I did not change anything that was not meant to be changed.
Tried to be careful when changing numbers as some = States according to their state/province column and some do not equal anything.
Accidentally changed some words like A(a)ustralia to A(a)States, and Austria to Statesa… which is ok as I am just including them as “other world”
There are issues with “Statesof A”(there are 4), “the best one -States”(1), “TheStates”(2), “The States” (1), and a couple of others but as there are so few so I have moved on
Cascadia includes Canada and States so I left it as is
United Kingdom is inclusive of “Endland”, “England”, “Scotland” and variations of “UK”
Those under “Ireland” are all in the Republic
There are some more notes on specific pieces of code within the code chunk
Vectors were created for Unknown countries and other countries of the world to change as many at once.
bound_country_clean <- bound_age_cleaning %>%
mutate(country = str_replace_all(country, pattern = "[0-9][0-9][.][0-9]", "States"),
country = str_replace_all(country, pattern = "[3|4|5][0-9]", "States"),
country = str_replace_all(country, pattern = "[uU][a-zA-Z.]* [sS][a-zA-Z]* [oO][fF] [Aa]*[a-zA-Z]*", "States"),
country = str_replace_all(country, pattern = "[ .]*[uU]+[a-zA-Z.]* [sS]+[a-zA-Z]*[ .]*", "States"),
country = str_replace_all(country, pattern = "[ ]*[uU]+[ .!]*[sS]+[ .!]*[aA]*[ .!]*", "States"),
# the above changes austria and australia - watch out!
country = str_replace_all(country, pattern = "[uU]+[nN]+[a-z]* [sS]+[tT][aA][tT][a-z]* [oO]+[fF] [aA][mM][a-z]*\t*", "States"),
# there's still on "United States of America" which won't go - spaces?
country = str_replace_all(country, pattern = "^[mM][uU|eE][rR]+[iI][cC|kK][aA]", "States"),
country = str_replace_all(country, pattern = "^[aA][mM][eE][rR][iI][cC][aA]", "States"),
country = str_replace_all(country, pattern = "^\\'[mM][uU|eE][rR][iI][cC][aA]", "States"),
country = str_replace_all(country, pattern = "[Tthe]*[ ]*[sS]+[tT][aA][tT][a-z]*[Ss][tT][a-z]*[Ss]*[tT]*[a-z]*[Ss]*[tT]*[a-z]*", "States"),
# removing States repeated
country = str_replace_all(country, pattern = "[Tt][a-z]* [bB][a-z]* [a-zA-Z]+ [-]+ [uU][sS][aA]", "States"),
# this was an attempt at "the best one - USA"
country = str_replace_all(country, pattern = "[nN][a-z]+ [cC|yY|jJ]+[a-z]*|[pP|tT]+[iI|rR]+[tT|uU]+[a-z]+|[cC][aA][lL][iI][a-z]*|[aA][lL][aA][a-z]*", "States"),
# The above is for "New York, North Carolina, New Jersey, Pittsburgh, Trumpistan, California, and Alaska (it also changed the end of New Zealand to New Zestates...)
country = str_replace_all(country, pattern = "Ahem....Amerca", "States"),
country = str_replace_all(country, pattern = "^[cC]+[aA]+[nN]+[aA]+[dD|eE]+[aA]*[aA|iI|rR]*[aA|nN]*[iI]*[aA]*[`]*|[s][o][v][a-z]* [cC][a-z]*", "Canada"),
country = str_replace_all(country, pattern = "^[eE]+[nN]+[a-zA-Z]*|^[sS]+[cC]+[a-zA-Z]*|[uU]+[nN]+[a-zA-Z]* [kK]+[a-zA-Z]*", "United Kingdom"),
country = str_replace_all(country, pattern = "^[eE]+[nN]+[a-zA-Z]*|^[sS]+[cC]+[a-zA-Z]*|[uU]+[nN]+[a-zA-Z]* [kK]+[.]*[a-zA-Z]*|[uU]+[.]*[kK]+[.]*", "United Kingdom")
)
bound_country_clean %>%
distinct(country)
# view(bound_country_clean)
Creating the “other” vectors and joining them to the data
rest_of_the_world <- c("Germany", "germany", "AStates", "aStates", "Statesa",
"Japan", "Mexico", "Netherlands", "netherlands", "The Netherlands",
"Sweden", "Belgium", "Ireland", "New ZeStates", "Switzerland",
"China", "France", "france", "Denmark", "Korea", "South Korea",
"Brasil", "cascadia", "Cascadia", "Costa Rica", "croatia",
"españa", "spain", "South africa", "Europe", "Finland",
"finland", "Greece", "hong kong", "Hong Kong", "hungary",
"Iceland", "Indonesia", "kenya", "Not theStatesor Canada",
"Panama", "Philippines", "Portugal", "Singapore", "sweden",
"Taiwan", "UAE")
unknown_country <- c("Can", "Statesof A","Sub-Canadian North America...", "'Merica",
"TheStates", "1", "A", "A tropical island south of the equator",
"Atlantis", "Denial", "Earth", "EUA", "Fear and Loathing",
"I don't know anymore", "god's country", "insanity lately",
"N. America", "I pretend to be from Canada, but I am really
from theStates", "Narnia", "Neverland", "one of the best ones",
"See above", "Somewhere", "States(I think but it's an election
year so who can really tell)", "States? Hard to tell anymore.",
"Statess", "subscribe to dm4uz3 on youtube", "the best one -States",
"The republic of Cascadia", "The States", "The Yoo Ess of Aaayyyyyy",
"there isn't one for old men", "this one", "UD",
"United States of America")
candy_cleaned <- bound_country_clean %>%
mutate(country = if_else(country %in% rest_of_the_world, "Other", country),
country = if_else(country %in% unknown_country, as.character(NA), country))
candy_cleaned %>%
distinct(country)
# still have 11 different "countries"
For the next three questions, count despair as -1, joy as +1, and meh as 0.
candy_pivot_ratings <- candy_cleaned %>%
select(-c(id_number,
age,
trick_or_treating,
state_or_prov)) %>%
pivot_longer(butterfinger:take_5,
names_to = "candy",
values_to = "rating") %>%
mutate(ratings_numeric = case_when(rating == "JOY" ~ 1,
rating == "DESPAIR" ~ -1,
rating == "MEH" ~ 0)
)
candy_pivot_ratings
####QUESTION SIX What was the most popular candy bar by this rating system for each gender in the dataset?
ANSWER The most popular candy bar for all four groups was “any full sized candy bar”. But, as decided in a previous question, this seems too generic so I would suggest the following top results: * “Female”: Reese’s Peanut Butter Cups. * “Male”: also Reese’s Peanut Butter Cups. * “Other”: Twix * “I’d rather not say”: Kit-Kat
candy_pivot_ratings %>%
filter(gender == "Female") %>%
group_by(candy) %>%
summarise(rating_count = mean(ratings_numeric, na.rm = TRUE)) %>%
arrange(desc(rating_count))
candy_pivot_ratings %>%
group_by(gender, candy) %>%
summarise(rating_count = mean(ratings_numeric, na.rm = TRUE)) %>%
arrange(desc(rating_count))
`summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
####QUESTION SEVEN What was the most popular candy bar in each year?
ANSWER * Again, the same “full sized candy bar” arguement re: it’s too generic means that the most popular candy bar in each year was: * 2015: Reese’s Peanut Butter Cups. * 2016: Kit-Kat. * 2017: Reese’s Peanut Butter Cups.
candy_pivot_ratings %>%
group_by(year, candy) %>%
summarise(rating_count = mean(ratings_numeric, na.rm = TRUE)) %>%
arrange(desc(rating_count))
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
NA
####QUESTION EIGHT What was the most popular candy bar by this rating for people in US, Canada, UK, and all other countries?
ANSWER “Any full sized candy bar” was ignored again due to being too generic, therefore each countries most popular candy was:
United States: Reese’s peanut butter cups Canada: Kit-Kat UK: Rolos The rest of the world: Kit-Kat
candy_pivot_ratings %>%
filter(country == "States") %>%
group_by(candy) %>%
summarise(rating_count = mean(ratings_numeric, na.rm = TRUE)) %>%
arrange(desc(rating_count))
candy_pivot_ratings %>%
filter(country == "Canada") %>%
group_by(candy) %>%
summarise(rating_count = mean(ratings_numeric, na.rm = TRUE)) %>%
arrange(desc(rating_count))
candy_pivot_ratings %>%
filter(country == "United Kingdom") %>%
group_by(candy) %>%
summarise(rating_count = mean(ratings_numeric, na.rm = TRUE)) %>%
arrange(desc(rating_count))
candy_pivot_ratings %>%
filter(country == "Other") %>%
group_by(candy) %>%
summarise(rating_count = mean(ratings_numeric, na.rm = TRUE)) %>%
arrange(desc(rating_count))